[Internal] DTS: Adds retries in DTS when isRetriable is true and on timeout#5689
[Internal] DTS: Adds retries in DTS when isRetriable is true and on timeout#5689Meghana-Palaparthi wants to merge 25 commits intomainfrom
Conversation
|
@sdkReviewAgent |
|
sdkReviewAgent | Status: ⏳ Queued Review requested by @xinlian12. I'll start shortly. |
|
sdkReviewAgent | Status: 🔍 Reviewing I'm reviewing this PR now. I'll post my findings as comments when done. |
# Conflicts: # Microsoft.Azure.Cosmos/src/DistributedTransaction/DistributedTransactionCommitter.cs # Microsoft.Azure.Cosmos/src/DistributedTransaction/DistributedTransactionOperationResult.cs # Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.Tests/DistributedTransaction/DistributedTransactionCommitterTests.cs
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
|
✅ Review complete (40:04) Posted 1 inline comment(s). Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage |
|
✅ Review complete (37:13) Posted 1 inline comment(s). Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage |
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
Refactor DTX retry policy into ClientRetryPolicy
- Add DTX-specific retry block in ClientRetryPolicy.ShouldRetryInternalAsync gated on IsDistributedTransactionRequest, covering:
* 449/5352 (DtcCoordinatorRaceConflict): honors Retry-After header
* 500/5411,5412,5413 (DtcLedgerFailure, DtcAccountConfigFailure, DtcDispatchFailure): transient infra failures safe to retry via idempotency token.
- Uses dedicated dtxRetryCount field (max 10) to avoid the single-master write restrictions in ShouldRetryOnUnavailableEndpointStatusCodes.
- 408 and 403.3 were already handled by the pipeline for all request types.
- 429/3200 (DtcLedgerThrottled) was already handled by ResourceThrottleRetryPolicy.
- Simplify DistributedTransactionCommitter.ExecuteCommitWithRetryAsync to only loop on the isRetriable JSON body flag (the one signal not visible at the pipeline level). Remove IsRetriableByStatusCode and the CosmosException-when-408 catch block.
- Update ClientRetryPolicyTests with DTX-specific tests:
* 408, 449/5352, 500/5411-5413 retry correctly for DTX requests
* Non-DTX requests are NOT retried on the same codes (gating tests)
* Budget exhaustion (10 retries) returns NoRetry
- Update DistributedTransactionCommitterTests: remove all status-code retry tests (now covered in ClientRetryPolicyTests), rename tests that assumed CosmosException was caught by the outer loop.
Co-Authored-By: Copilot <223556219+Copilot@users.noreply.github.com>
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
|
Azure Pipelines: Successfully started running 1 pipeline(s). |
| { | ||
| private static readonly TimeSpan DefaultRetryBaseDelay = TimeSpan.FromSeconds(1); | ||
|
|
||
| internal const int MaxIsRetriableRetryCount = 100; |
There was a problem hiding this comment.
I am curious, if this retries 100 times, and inside ClientRetryPolicy we retry 100 times, wouldn't it be like 100k retries for a single request?
To avoid this confusion, maybe we can combine these two retry mechanisms into a single layer?
There was a problem hiding this comment.
Good catch, I think worst this is 10k retries on a single request (before throttling) . If we want to combine maybe a passing response header to the client retry policy could work? If we wanted to keep the 2 retry layers and not have a new header we can probably just limit it to 3-5 retries here to be safe.
Description
This pull request introduces robust retry logic for distributed transaction commits and improves the handling and parsing of distributed transaction responses in the Cosmos DB SDK. The changes ensure that commit operations are retried safely in the event of timeouts or retriable errors, enhance diagnostics, and make response parsing more resilient. Additionally, the request and response classes are refactored for safer stream handling and improved reliability.
Reference doc for DTX retries: dtx-sdk-response-status-codes.md
Distributed transaction commit improvements:
DistributedTransactionCommitterResponse parsing and diagnostics enhancements:
isRetriableandserverDiagnosticsfields, and improved resilience to partial JSON parsing failures.IsRetriableproperty toDistributedTransactionResponseand ensured it is correctly populated from server responses.Request stream handling improvements:
DistributedTransactionServerRequestto use a pre-serialized byte array for the request body, enabling safe creation of new memory streams for each retry and preventing disposal issues.Reliability and correctness fixes:
DistributedTransactionResponse.DistributedTransactionOperationResultto throw explicit exceptions on failure.Miscellaneous:
Type of change
Please delete options that are not relevant.
Closing issues
To automatically close an issue: closes #IssueNumber